Skip to content

ARROW-12028 ARROW-11940: [Rust][DataFusion] Add TimestampMillisecond support to GROUP BY/hash aggregates#9773

Closed
velvia wants to merge 5 commits into
apache:masterfrom
urbanlogiq:evan/pr-arrow-12028-groupby-tsmillis
Closed

ARROW-12028 ARROW-11940: [Rust][DataFusion] Add TimestampMillisecond support to GROUP BY/hash aggregates#9773
velvia wants to merge 5 commits into
apache:masterfrom
urbanlogiq:evan/pr-arrow-12028-groupby-tsmillis

Conversation

@velvia

@velvia velvia commented Mar 22, 2021

Copy link
Copy Markdown
Contributor

This PR adds support for TimestampMillisecondArray to Scalar, GroupByScalar, and hash_aggregate / GROUP BYs to Rust DataFusion, thus fixing 12028. I believe it might also fix 11940, though this needs input from you all.

There are some formatting changes from running cargo fmt.

@velvia

velvia commented Mar 22, 2021

Copy link
Copy Markdown
Contributor Author

@nevi-me @andygrove

@alamb

alamb commented Mar 23, 2021

Copy link
Copy Markdown
Contributor

Thanks @velvia -- I plan to review this tomorrow


use std::{convert::TryFrom, fmt, iter::repeat, sync::Arc};

use arrow::array::{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports don't seem to be necessary? The module already has array::*, ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh they were reformatted by cargo fmt which is what the README told me to run.....

DataType::UInt64 => {
equal_rows_elem!(UInt64Array, l, r, left, right)
}
DataType::Boolean => equal_rows_elem!(BooleanArray, l, r, left, right),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why the formatting changed? Do you use a recent/stable rust/cargo/rustfmt version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I'm running, is it too old?
nightly-2020-11-25-x86_64-apple-darwin (default)

Should arrow project have a rust-toolchain to specify what version of Rust to run on?

@velvia velvia Mar 24, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I tried running cargo fmt with +stable on 1.50.0 and it did not generate any new diffs compared to what I have here....

@Dandandan

Copy link
Copy Markdown
Contributor

This looks good @velvia , I added some minor comments.

@kou kou changed the title [rust][datafusion] Fix 12028 and 11940: Add TimestampMillisecond support to GROUP BY/hash aggregates ARROW-12028 ARROW-11940: [Rust][DataFusion] Add TimestampMillisecond support to GROUP BY/hash aggregates Mar 24, 2021
@github-actions

Copy link
Copy Markdown

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @velvia -- overall looks good.

The only thing I think is needed for this PR is tests for grouping on columns with TimestampMillisecond.

ScalarValue::UInt16(Some(v)) => GroupByScalar::UInt16(*v),
ScalarValue::UInt32(Some(v)) => GroupByScalar::UInt32(*v),
ScalarValue::UInt64(Some(v)) => GroupByScalar::UInt64(*v),
ScalarValue::TimeMillisecond(Some(v)) => GroupByScalar::TimeMillisecond(*v),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add support for ScalarValue::TimeMicrosecond and ScalarValue::TimeNanosecond?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I wasn't sure why they weren't added here....

@velvia

velvia commented Mar 29, 2021

Copy link
Copy Markdown
Contributor Author

@alamb and others: have added a test and I believe addressed comments, hope this is all needed!

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @velvia !

@alamb

alamb commented Mar 30, 2021

Copy link
Copy Markdown
Contributor

@velvia the lint job seems to have failed -- looks like the tests need to get formatted

https://github.com/apache/arrow/pull/9773/checks?check_run_id=2221026494

I tried to fix this PR by checking in the result of running cargo fmt locally, but I don't have access to push to urbanlogiq/arrow

@velvia

velvia commented Mar 30, 2021

Copy link
Copy Markdown
Contributor Author

@alamb pushed the cargo fmt fix. :)

@velvia

velvia commented Mar 31, 2021

Copy link
Copy Markdown
Contributor Author

It seems the Docker build failed but I can't click on or expand on the details.

@alamb

alamb commented Mar 31, 2021

Copy link
Copy Markdown
Contributor

It seems the Docker build failed but I can't click on or expand on the details.

Yeah, that has been happening recently. I don't think it is related to this PR. I plan to merge this one as soon as the remaining tests pass

@alamb alamb closed this in af4d5f4 Mar 31, 2021
@velvia velvia deleted the evan/pr-arrow-12028-groupby-tsmillis branch April 6, 2021 18:03
alamb added a commit that referenced this pull request Apr 13, 2021
…s for Timestamp(_,_)

# Rationale:
If you try and aggregate (via SUM, for example) a column of a timestamp type, DataFusion generates an error:
```
Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.
```

For example, from IOx

```
> show columns from t;
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type                   | is_nullable |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| datafusion    | public       | t          | a           | Utf8                        | NO          |
| datafusion    | public       | t          | b           | Timestamp(Nanosecond, None) | NO          |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
2 row in set. Query took 0 seconds.
> select sum(b) from t;
Plan("Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.")
```

# Changes:
Add support for aggregating timestamp types and tests for same

# Notes
Note this is follow on / more fleshing out of the work done in #9773 by @velvia (👋  thanks for adding Timestamps to `ScalarValue`)

Supporting AVG on timestamps is tracked by https://issues.apache.org/jira/browse/ARROW-12318. It is more involved (as currently Avg assumes the output type is always F64), and not important for myuse case at the moment.

Closes #9970 from alamb/alamb/ARROW-12277-aggregate-timestamps

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants